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Abstract 

We discuss an attentional model for simultaneous object tracking and recognition that 
is driven by gaze data. Motivated by theories of perception, the model consists of 
two interacting pathways: identity and control, intended to mirror the what and where 
pathways in neuroscience models. The identity pathway models object appearance and 
performs classification using deep (factored)-Restricted Boltzmann Machines. At each 
point in time the observations consist of foveated images, with decaying resolution to- 
ward the periphery of the gaze. The control pathway models the location, orientation, 
scale and speed of the attended object. The posterior distribution of these states is esti- 
mated with particle filtering. Deeper in the control pathway, we encounter an attentional 
mechanism that learns to select gazes so as to minimize tracking uncertainty. Unlike in 
our previous work, we introduce gaze selection strategies which operate in the presence 



of partial information and on a continuous action space. We show that a straightforward 
extension of the existing approach to the partial information setting results in poor per- 
formance, and we propose an alternative method based on modeling the reward surface 
as a Gaussian Process. This approach gives good performance in the presence of par- 
tial information and allows us to expand the action space from a small, discrete set of 
fixation points to a continuous domain. 



1 Introduction 



Humans track and recognize objects effortlessly and efficiently, exploiting attentional 
mechanisms (| Rensink[ |2000[ [Colombo! |2001[ ) to cope with the vast stream of data. We 
use the human visual system as inspiration to build a system for simultaneous object 
tracking and recognition from gaze data. An attentional strategy is learned online to 
choose fixation points which lead to low uncertainty in the location of the target ob- 
ject. Our tracking system is composed of two interacting pathways. This separation 
of responsibility is a common feature in models from the computational neuroscience 
literature as it is believed to reflect a separation of information processing into ventral 
and dorsal pathways in the human brain ( [Olshausen et al.[[r993a| ). 

The identity pathway (ventral) is responsible for comparing observations of the 
scene to an object template using an appearance model, and on a higher level, for clas- 
sifying the target object. The identity pathway consists of a two hidden layer deep 
network. The top layer corresponds to a multi-fixation Restricted Boltzmann Machine 
(RBM) ( [Larochelle & Hinton] |2010| ), as shown in Figure [TJ It accumulates informa- 
tion from the first hidden layers at consecutive time steps. For the first layers, we use 
(factored)-RBMs pinton & Salakhutdinovl [2006t [Ranzato & Hintonj [20T0[ [WeUing 
[eTaTj [2005t [Swersky et"ar| [20TT] ), but autoencoders ( [Vincent et aL| [2008] ), sparse 
coding ( jOlshausen & Fieldj [1996[ jKavukcuoglu et al.] |2009[ ), two-layer ICA ( [Koster 
[& Hyvarinen[ |2007| ) and convolutional architectures ( [Lee et aL| |2009| ) could also be 



adopted. 

The control pathway (dorsal) is responsible for aligning the object template with the 
full scene so the remaining modules can operate independently of the object's position 
and scale. This pathway is separated into a localization module and a fixation module 
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Figure 1: From a sequence of gazes (v^, v^+i, . . .), the model infers the hidden features hfor 
each gaze (that is, the activation intensity of each receptive field), the hidden features for the 
fusion of the sequence of gazes and the object class c. Only one time step of classification is kept 
in the figure for clarity. The location, size, speed and orientation of the gaze patch are encoded 
in the state i^t- The actions follow a learned policy TVt that depends on the past rewards 
{ri, . . . , rt-i}. This particular reward is a function of the belief state ht — p(xt|ai:t, hi:^), also 
known as the filtering distribution. Unlike typical commonly used partially observed Markov 
decision models (POMDPs), the reward is a function of the beliefs. In this sense, the problem is 
closer to one of sequential experimental design. With more layers in the ventral v — h — h'^l — c 
pathway, other rewards and policies could be designed to implement higher-level attentional 
strategies. 
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which work cooperatively to accomphsh this goal. The localization module is imple- 
mented with a particle filter ( [Doucet et al4|200l] ) which estimates the location, velocity 
and scale of the target object. We make no attempt to implement such states with neural 



architectures, but it seems clear that they could be encoded with grid cells (McNaughton 



et al. , 2006) and retinotopic maps as in VI and the superior coUiculus (|Rosa, 2002; Gi- 



rard & Berthoz, 2005). The fixation module learns an attentional strategy to select 



fixation points relative to the object template. These fixation points are the centres of 
partial template observations, and are compared with observations of the corresponding 
locations in the scene using the appearance model (see Figure [2]). Reward is assigned 
to each fixation based on the uncertainty of the target location at each time step. The 
fixation module uses the reward signal to adapt its gaze selection policy to achieve good 
localization. Our previous work ( [Bazzani et al.[ |2010| ) used Hedge ( |Auer et aLj |1998a 



Freund & Schapirej |1997| ) to learn this policy. In this extended paper we show that a 



straightforward adaptation of our previous approach to the partial information setting 
results in poor performance, and we propose an alternative method based on modelling 
the reward surface as a Gaussian Process. This approach gives good performance in the 
presence of partial information and allows us to expand the action space from a small, 
discrete set of fixation points to a continuous domain. 

The proposed system can be motivated from different perspectives. First, starting 
with |Isard & Blake| ( |1996| ), many particle filters have been proposed for image tracking. 



but these typically use simple observation models such as B-splines (Isard & Blake 



1996| ) and colour templates ( jOkuma et al.[ |2004| ). RBMs are more expressive mod- 



els of shape, and hence, we conjecture that they will play a useful role where simple 
appearance models fail. Second, from a deep learning computational perspective, this 
work allows us to tackle large images and video, which is typically not possible due 
to the number of parameters required to represent large images in deep models. The 
use of fixations synchronized with information about the state (e.g. location and scale) 
of such fixations eliminates the need to look at the entire image or video. Third, the 
system is invariant to image transformations encoded in the state, such as location, 
scale and orientation. Fourth, from a dynamic sensor network perspective, this paper 
presents a very simple, but efficient and novel way of deciding how to gather measure- 
ments dynamically. Lastly, in the context of psychology, the proposed model realizes 
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Figure 2: Left: A typical video frame with the estimated target region highlighted. To cope with 
the large image size our system considers only the target region at each time step. Centre left: 
A close-up of the template extracted from the first frame. The template is compared to the target 
region by selecting a fixation point for comparison as shown. Centre right: A visualization of a 
single fixation. In addition to covering only a very small portion of the original frame, the image 
is foveated with high resolution near the centre and low resolution on the periphery to further 
reduce the dimensionality. Right: The most active filters of the first layer (factored)-RBM when 
observing the displayed location. The control pathway compares these features to the features 
active at the corresponding scene location in order to update the belief state. 



to some extent the functional architecture for dynamic scene representation of Rensink 



(2000). The rate at which different attentional mechanisms develop in newborns (in- 
cluding alertness, saccades and smooth pursuit, attention to object features and high- 
level task driven attention) guided the design of the proposed approach and was a great 



source of inspiration (Colombo, 2001 ) 



Our attentional model can be seen as building a saliency map ( |Koch & UUman 
|1985| ) over the target template. Previous work on saliency modelling has focused on 
identifying salient points in an image using a bottom up process which looks for out- 
liers under some local feature model (which may include a task dependent prior, global 
scene features, or various other heuristics). These features can be computed from static 



images ( [Torralba et al.[|2006| ), or from local regions of spacetime ( [Gaborski et al.[[2004| ) 



for video. Additionally, a wide variety of different feature types have been applied to 



this problem, including engineered features (GaoetaL , 2007) as well as features that are 



learned from data ( [Zhang et al.[|2009| ). Core to these methods is the idea that saliency 
is determined by some type of novelty measure. Our approach is different, in that rather 
than identifying locally or globally novel features, our process identifies features which 
are useful for the task at hand. In our system the saliency signal for a location comes 
from a top down process which evaluates how well the features at that location enable 



the system to localize the target object. The work of |Gao et aLl ( |2007[ ) considers a simi- 
lar approach to saliency by defining saliency to be the mutual information between the 
features at a location and the class label of an object being sought; however, in order 
to make their model tractable the authors are forced to use specifically engineered fea- 
tures. Our system is able to cope with arbitrary feature types, and although we consider 
only on localization in this paper, our model is sufficiently general to be applied to 
identifying salient features for other goals. 

Recently, a dynamic RBM state-space model was proposed in [Taylor et al.| ( [20T0| ). 
Both the implementation and intention behind that proposal are different from the ap- 
proach discussed here. To the best of our knowledge, our approach is the first successful 
attempt to combine dynamic state estimation from gazes with online policy learning for 
gaze adaptation, using deep network network models of appearance. Many other dual- 



pathway architectures have been proposed in computational neuroscience, including Ol- 



[shausen et al.| ( |1993b| ) and |Postma et al.| ( fl997| ), but we believe ours has the advantage 
that it is very simple, modular (with each module easily replaceable), suitable for large 
datasets and easy to extend. 



2 Identity Pathway 

The identity pathway in our model mirrors the ventral pathway in neuroscience models. 
It is responsible for modelling the appearance of the target object and also, at a higher 
level, for classification. 



2.1 Appearance Model 



We use (factored)-RBMs to model the appearance of objects and perform object classi- 
fication using the gazes chosen by the control module (see Figure [3]). These undirected 
probabilistic graphical models are governed by a Boltzmann distribution over the gaze 
data and the hidden features G {0, 1}^^. We assume that the receptive fields w, 
also known as RBM weights or filters, have been trained beforehand. We also assume 



that readers are familiar with these models and, if otherwise, refer them to [Ranzato & 
Hinton| ( [20T0| ) and |Swersky"eraL| ( [20T0| ). 
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Figure 3: An RBM senses a small foveated image derived from the video. The level of activation 
of each filter is recorded in the units. The RBM weights (filters) W are visualized in the upper 
left. We currently pre-train these weights. 



2.2 Classification Model 

The identity pathway also performs object recognition, classifying a sequence of gaze 
instances selected with the gaze policy. We implement a multi-fixation RBM very sim- 
ilar to the one proposed in |Larochelle & Hinton| ( |2010| ), where the binary variables Zf 
(see Figure |4]) are introduced to encode the relative gaze location within the multi- 
fixation RBM (a "I in or "one hot" encoding of the gaze location was used for 

The multi-fixation RBM uses the relative gaze location information in order to ag- 
gregate the first hidden layer representations at A consecutive time steps into a single, 
higher level representation ' . 

More specifically, the energy function of the multi-fixation RBM is: 

A F 



[2]> 



1=1 



where the notation Pj. refers to the /^^ row vector of the matrix P. From this en- 
ergy function, we define a distribution over h^-A+i:t and hf ' (conditioned on Zt-A^i-.t) 
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Figure 4: Gaze accumulation and classification in the identity pathway. A multi-fixation RBM 
models the conditional distribution (given the gaze positions Rt) of A consecutive hidden fea- 
tures h.t, extracted by the first layer RBM on the foveated images. In this illustration, A = 2. 
The multi-fixation RBM encodes the gaze position at in a "one hof representation noted T^t. 
The activation probabilities of the second layer hidden units \\\ ^ are used by a classifier to 
predict the object's class. 



through the Boltzmann distribution: 

where the normahzation constant Z{zt-/\+i:t) ensures that Equation [T] sums to 1. To 
sample from this distribution, one can use Gibbs samphng by alternating between sam- 
pling the top-most hidden layer hf ' given all individual processed gazes ht_A+i:t and 
vice versa. To train the multi-fixation RBM, we collect a training set consisting in se- 
quences of A pairs (ht, z^) by randomly selecting A gaze positions at which to fixate 
and computing the associated h^. These sets are extracted from a collection of images 
in which the object to detect has been centred. Unsupervised learning using contrastive 
divergence can then be performed on this training set. See |Larochelle & Hintonj ( [20T0| ) 
for more details. 

The main difference between this multi-fixation RBM and the one described in 
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Larochelle & Hinton 



(2010) is that does not explicitly model the class label Ct. 



Instead, a multinomial logistic regression classifier is trained separately, to predict Ct 
from the aggregated representation extracted from hf . More specifically, we use the 
vector of activation probabihties of all hidden units hfj in hf , conditioned on ht-A+i-.t 
and z^_A+i:f, as the aggregated representation: 

(A F 
+ ^^P/j(^/v^^^-A+i)(V/,:Zt-A+i) 
i=l f=l 

We experimented with a single fixation module, but found the multi-fixation module to 
increase classification accuracy. To improve the estimate the class variable Ct over time, 
we accumulate the classification decisions at each time step. 

Note that the process of pursuit (tracking) is essential to classification. As the target 
is tracked, the algorithm fixates at locations near the target's estimated location. The 
size and orientation of these fixations also depends on the corresponding state estimates. 
Note that we don't fixate exactly at the target location estimate as this would provide 
only one distinct fixation over several time steps if the tracking policy has converged 
to a specific gaze. It should also be pointed out that instead of using random fixations, 
one could again use the control strategy proposed in this paper to decide where to look 
with respect to the track estimate so as to reduce classification uncertainty. We leave 
the implementation of this extra attentional mechanism for future work. 



3 Control Pathway 

The control pathway mirrors the responsibility of the dorsal pathway in human visual 
processing. It tracks the state of the target (position, speed, etc) and normalizes the 
input so that other modules need not account for these variations. At a higher level 
it is responsible for learning an attentional strategy which maximizes the amount of 
information learned with each fixation. The structure of the control pathway is shown 
in Figure [5j 

3.1 State-space model 

The standard approach to image tracking is based on the formulation of Markovian, 
nonlinear, non-Gaussian state-space models, which are solved with approximate Bayesian 
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Figure 5: Influence diagram for the control pathway. The true state of the tacked object x^, 
generates some set of features ht, in the identity pathway. These features depend on the action 
chosen at time t and are used to update the belief state ht. Statistics of the belief state are 
collected to compute the reward Vt, which is used to update the policy for the next time step. 

filtering techniques. In this setting, the unobserved signal (object's position, veloc- 
ity, scale, orientation or discrete set of operations) is denoted {x^ G A'; t G N}. This 
signal has initial distribution p(xo) and transition equation p (x^l Xt_i, at_i) . Here 
at G vA denotes an action at time t, defined on a compact set A. For descrete poli- 
cies A is finitie whereas for continuous policies ^ is a region in M^. The observations 
{ht G t G N*}, are assumed to be conditionally independent given the process state 
{xt; t G N}. Note that from the state space model perspective the observations are the 
hidden units of the second layer of the of the appearance model in the identity pathway. 
In summary, the state-space model is described by the following distributions: 

p(xo) 

p(xt|xt_i,at_i) fort > 1 
p(ht|xt,at) fort > 1, 

For the transition model, we will adopt a classical autoregressive process. 
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Our aim is to estimate recursively in time the posterior distribution p ( xo:^ | hi-^ , ai:^ ) 
and its associated features, including the marginal distribution = p (x^l hi-^, ai:^) — 
known as the filtering distribution or belief state. This distribution satisfies the follow- 
ing recurrence: 

b.ocp(h.|x.,a.)/p(x,|x,_„a,_Op(<ix,_,|h,,_„a,,_0. 

Except for standard distributions (e.g. Gaussian or discrete), this recurrence is intractable. 

After learning the observation model we will use it for tracking. The observation 
model is often defined in terms of the distance of the observations from a template r, 

p (h^|x^, a^) oc exp (-rf(h(x^, a^), r)) , 

where •) denotes a distance metric and r an object template (for example, a color 
histogram or spline). In this model, the observation h(xt, a^) is a function of the current 
state hypothesis and the selected action. The problem with this approach is eliciting a 
good template. Often color histograms or splines are insufficient. For this reason, we 
will construct the templates with (factored)-RBMs as follows. First, optical flow is used 
to detect new object candidates entering the visual scene. Second, we assign a template 
to the detected object candidate, as shown in Figure [2| The same figure also shows a 
typical foveated observation (higher resolution in the center and lower in the periphery 
of the gaze) and the receptive fields for this observation learned beforehand with an 
RBM. The control algorithm will be used to learn which parts of the template are most 
informative, either by picking from amoung a predefined set of fixation points, or by 
using a continuous policy. Finally, we define the likelihood of each observation directly 
in terms of the distance of the hidden units of the RBM h(x^, a^, v^), to the hidden units 
of the corresponding template region h(xi, at = k, vi). That is, 

p(ht|xt,at = k) (X exp (-rf(h(xt,at = fc, v^), h(xi, a^ = fc, vi))) . 

The above template is static, but conceivably one could adapt it over time. 

3.2 Reward Function 

A gaze control strategy specifies a policy 7r(-) for selecting fixation points. The purpose 
of this strategy is to select fixation points which maximize an instantaneous reward 

^ We use the notation xo:t = {xq, x^} to represent the past history of a variable over time. 
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function rt(). The reward can be any desired behaviour for the system, such as mini- 
mizing posterior uncertainty or achieving a more abstract goal. We focus on gathering 
observations so as to minimize the uncertainty in the estimate of the filtering distribu- 
tion: rt{3.t\bt) — u[p{-Xt\hi.t, 3.i:t)]. More specifically, as discussed later, this reward 
will be a function of the variance of the importance weights of the particle filter approx- 
imation p{-Xt\hi.t, 3.i:t) of the belief state. 

It is also useful to consider the cumulative reward 

T 

t=i 

which is the sum of the instantaneous rewards which have been received up to time T. 
The gaze control strategies we consider are all "no-regret" which means that the average 
gap between our cumulative reward and the cumulative reward from always picking the 
optimal action goes to zero as T ^ oc. 

In our current implementation, each action is a different gaze location and the ob- 
jective is to choose where to look so as to minimize the uncertainty about the belief 
state. 



4 Gaze control 

We compare several different strategies for learning the gaze selection policy. In an 



earlier version of this work (Bazzani et al., 2010) we learned the gaze selection policy 



with a portfolio allocation algorithm called Hedge ( [Freund & Schapirej |1997[ |Auer 



et al.[ |1998b| ). Hedge requires knowledge of the rewards for all actions at each time 



step, which is not realistic when gazes must be preformed sequentially, since the target 
object will move between fixations. We compare this strategy, as well as two baseline 
methods, to two very different alternatives. 



EXP3 is an extension of Hedge to partial information games ( |Auer et aL| |2001| ). 
Unlike Hedge, EXP3 requires knowledge of the reward only for the action selected 
at each time step. EXP3 is more appropriate to the setting at hand, and is also more 
computationally efficient than Hedge; however, this comes at a cost of substantially 
lower theoretical performance. 

Both Hedge and EXP3 learn gaze selection policies which choose among a discrete 
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set of predetermined fixation points. We can instead learn a continuous policy by es- 
timating the reward surface using a Gaussian Process ( [Rasmussen & Williams[[2006| ). 
By assuming that the reward surface is smooth, we can draw on the tools of Bayesian 
optimization ( [Brochu et al4 12010| ) to search for the optimal gaze location using as few 
exploratory steps as possible. 

The following sections describe each of these approaches in more detail. 



4.1 Baseline 

We consider two baseline strategies, which we call random and circular. The random 
strategy samples gaze selections uniformly from a small discrete set of possibilities. 
The circular strategy also uses a small discrete set of gaze locations and cycles through 
them in a fixed order. 



4.2 Hedge 



To use Hedge ( [Freund & Schapire]|1997[|Auer et al.[|1998b| ) for gaze selection we must 



first discretize the action space by selecting a fixed finite number of possible fixation 
points. Hedge maintains an importance weight G{i) for each possible fixation point and 
uses them to form a stochastic policy at each time step. An action is selected according 
to this policy and the reward for each possible action is observed. These rewards are 
then used to update the importance weights and the process repeats. Pseudo code for 
Hedge is shown in Algorithm [TJ 

Algorithm 1 Hedge 
Input: 7 > 

Input: Go{i) ^ foreach i e A 
fort = 1,2, ...do 
for i e Ado 

rn (n\ ^ exp(7Gt-i(i)) 
^^^^^ ^ E,e^exp(7G,_i(j)) 

Rt ^ {pt{^), • • • ,Pt{\A\)) II sample an action from the distribution {pt{k)) 

for i ^ AAo 

nii) ^ rt{i\ht) 

Gtii) ^G,_i(z)+r,(i) 
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4.3 EXP3 



EXP3 ( |Auer et am2001[ ) is a generalization of Hedge to the partial information setting. 
In order to maintain estimates for the importance weights, Hedge requires reward infor- 
mation for each possible action at each time step. EXP3 works by wrapping Hedge in 
an outer loop which simulates a fully observed reward vector at each time step. EXP3 
selects actions based on a mixture of the policy found by Hedge and a uniform distribu- 
tion. EXP3 is able to function in the presence of partial information, but this comes at 
the cost of substantially worse theoretical guarantees. Pseudo code for EXP3 is shown 
in Algorithm [2| 

Algorithm 2 EXP3 
Input: 7 G (0, 1] 

Initialize Hedge (7) 

forte 1,2,... do 

Receive from Hedge 

^ (1 - 7)Pt + ^ 
a,^(p,(l),...,p,(|^|)) 

n{j)/Pt{j) ifi = at 



Simulate reward vector for Hedge where rt{j) ^ 



otherwise 



4.4 Bayesian Optimization 

Both Hedge and EXP3 discretize the space of possible fixation points and learn a dis- 
tribution over this finite set. In contrast, Bayesian optimization is able to treat the space 
of possible fixation points as fully continuous by placing a smoothness prior on how 
reward is expected to vary with location. Intuitively, if we know the reward at one lo- 
cation, then we expect other, nearby locations to produce similar rewards. Gaussian 



Process priors encode this type of belief ( [Rasmussen & Williams[ 12006) , and have been 
used extensively for optimization of cost functions when it is important to minimize the 
total number of function evaluations ( [Brochu et al.[|201()| ). 

We model the latent reward function r^(a^ |bt) = r(at|bt, 0^) as a zero mean Gaus- 
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sian Process 



where is the behef state (see Section |3.1| ), and Ot are the model hyperparameters. 
The kernel function •), gives the covariance between the reward at any two gaze 
locations. To ease the notation, the explicit dependence of r(-) and •) on and Of 
will be dropped. 

We assume that the true reward function r(-) is not directly measurable, and what 
we observe are measurements of this function corrupted by Gaussian noise. That is, at 
each time step the instantaneous reward r^, is given by 

where 5n ^ A/'(0, 1) and is a hyperparameter indicating the amount of observation 
noise, which we absorb into 9t. 

Given a set of observations we can compute the posterior predictive distribution for 



r(.): 



r(a|ri.t,ai:t) ~ A/'(mt(a),Si^(a)) , 



(2) 



where 



K 



k = 



ri:t 



A:(ai,ai) 

k{at,ai) 
/c(ai,a) 



A:(ai,at) 



Hat, at) 



A; (at, a) 



ri 



n 



It remains to specify the form of the kernel function, ■). We experimented with 
several possibilities, but found that the specific form of the kernel function is not crit- 
ical to the performance of this method. For the experiments in this paper we used the 
squared exponential kernel, 



D 



k{sii,aj) = cr^exp 



k=l 
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where and the {^i, . . . , are hyperparameters. 

Equation [2] is a Gaussian Process estimate of the reward surface and can be used to 
select a fixation point for the next time step. This estimate gives both a predicted reward 
value and an associated uncertainty for each possible fixation point. This is the strength 
of Gaussian Processes for this type of optimization problem, since the predictions can 
be used to balance exploration (choosing a fixation point where the reward is highly 
uncertain) and exploitation (choosing a point we are confident will have high reward). 

There are many selection methods available in the literature which offer different 



tradeoffs between these two criteria. In this paper we use GP-UCB (Srinivas et al 



2010D which selects 

a^+i = argmaxmt(a) + ^J%St{^) (3) 

a 

where /3t is a parameter. The setting /3t = 21og(t^7r^/35) (with 5 = 0.001) is used 
throughout this paper. 

Equation [3] must still be optimized to find a^+i, which can be performed using stan- 
dard global optimization tools. We use DIRECT ( [Jones etaL||1993| ) due to the existence 
of a readily available implementation. 

The Gaussian Process regression is controlled by several hyperparameters (see Fig- 
ure [6]): cr^ controls the overall magnitude of the covariance, and controls the amount 
of observation noise. The remaining parameters {ii, . . . ,£d} art length scale parame- 
ters which control the range of the covariance effects in each dimension. 

Treatment of the hyperparameters requires special consideration in this setting. The 
pure Bayesian approach is to put a prior on each parameter and integrate them out 
of the predictive distribution. However, since the integrals involved are not tractable 
analytically, this requires computationally expensive numerical approximations. Speed 
is an issue here since GP-UCB requires that we optimize a function of the posterior 
process at each time step so, for instance, computing Monte Carlo averages for each 
evaluation of Equation [2] is prohibitively slow. 

An alternative approach is to choose parameter values via maximum likelihood. 
This can be done quickly, and allows us to make speedy predictions; however, in this 
case we suffer from problems of data scarcity, particularly early in the tracking process 
when few observations have been made. The length scale parameters are particularly 



16 



r 



r(a,) 



^-0 





Figure 6: Graphical model for Bayesian optimization. The ii are length scales in each dimen- 
sion, cr^ is the magnitude parameter and cr^ is the noise level. In our model and follow 
a uniform prior and the ii follow independent Student-t priors. 



prone to receiving very poor estimates when there is httle data available. 

We have found that using informative priors for the length scale parameters and 
making MAP, rather than ML, estimates at each time step provides a solution to the 
problems described above. MAP estimates can be made quickly using gradient opti- 
mization methods (j Rasmusse n & Williams[ |2006| ), and informative priors provide re- 
sistance to the problems encountered with ML. The experiments in Section [6] place 
uniform priors on the magnitude and noise parameters and place independent Student-t 
priors on each length scale parameter. The experiments also use an initial data collec- 
tion phase of 10 time steps before any adjustment of the parameters is made. 



5 Algorithm 

Since the belief state cannot be computed analytically, we will adopt particle filtering 
to approximate it. The full algorithm is shown in Algorithm [3] 



We refer readers to [Doucet et al.| ( |2001| ) for a more in depth treatment of these 



sequential Monte Carlo methods. Assume that at time t — 1 we have :» 1 parti- 
cles (samples) {^^^\-i}f=i distributed according to p{d-^^,t-i\^i:t-i,^i:t-i)' We can 
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Algorithm 3 Particle filtering algorithm with gaze control. The algorithm shown here 
is for partial information strategies. For full information strategies the importance sam- 
pling step is done independently for each possible action and the gaze control step is 
able to use reward information from each possible action to create the new strategy 

1. Initialization 

for i = 1 to do 

Initialize the policy 7ri(-) 
for t = 1 . . . do 

2. Importance sampling 
for z = 1 to do 



// How this is done depends on the control strategy 



// Predict the next state 



qt (^rfxj'^ xS_i,hi:^,ai:i 



for z = 1 to do 



// Select an action according to the policy 
// Evaluate the importance weights 



w. 



(0 



Qt (xf'^ x£_i,hi:t,ai:t) 



for i = 1 to do 



II Normalize the importance weights 



3. Gaze control 



// Receive reward for the chosen action 



Incorporate into the policy to create 7rt+i(-) 
4. Selection 

Resample with replacement A^ particles (x, 



(0 



Xq.^; z = 1, . . . , A^^ according to the normalized importance weights w. 



1, . . . , A^ ) from the set 



(0 
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approximate this belief state with the following empirical distribution 

1 ^ 

pi ^ " ' ^ 



1 ^ 

N ^ ^ ^0:t-l 



1=1 

Particle filters combine sequential importance sampling with a selection scheme de- 
signed to obtain new particles {xq!^J^^ distributed approximately according to p ((ixo:t |hi:t, ai:, 

5.1 Importance sampling step 

The joint distributions p {(b^Q,t-i\^i:t-i'> ^v.t-i) and p {(b^Q,t\^i,t'> ^v.t) are of different 
dimension. We first modify and extend the current paths Xq!)_^ to obtain new paths Xq!) 
using a proposal kernel qt ((ixo:t |xo:t-i, hi-^, ai:^) . As our goal is to design a sequential 
procedure, we set 

qt ((ixo:t| Xo:t_i, hi-t, ai:t) = 5xo:t-i {(&Q,t-i) qt (rfx^l Xo:t-l, hi-^, ai:t) , 

that is xo:t = (xo:t-i, x^). The aim of this kernel is to obtain new paths whose distribu- 
tion 

qt ((iXo:t|hi:^, ai:t) = p (rfXo:t_l|hi:^_i, ai:t_i) qt (rfx^l Xo:t_l, hi:^, ai:^) , 

is as "close" as possible top ((ixo:t|hi.^, ai:t). Since we cannot choose ((ixo:f|hi.^, ai:^) = 
p (dxo:t|hi.t, ai:^) because this is the quantity we are trying to approximate in the first 
place, it is necessary to weight the new particles so as to obtain consistent estimates. 
We perform this "correction" with importance sampling, using the weights 

^ ^ p (h^|x^, a^) p (rfx^|xo:t-i, at_i) 
Wt = Wt-i ^ . 

(?t (dx^|Xo:t-l,hi:^,ai:t) 

The choice of the transition prior as proposal distribution is by far the most com- 
mon one. In this case, the importance weights reduce to the expression for the like- 
lihood. However, it is possible to construct better proposal distributions, which make 
use of more recent observations, using object detectors ( |Okuma et al.[|2004| ), saliency 



maps ( |Itti et al.] |1998| ), optical flow, and approximate filtering methods such as the 



unscented particle filter. One could also easily incorporate strategies to manage data 

(i) 

association and other tracking related issues. After normalizing the weights, wl = 

~(^) 

^(^-^ , we obtain the following estimate of the filtering distribution: 



N 



p((iXo:t|hi:^,ai:t) = > W^'^ 5 (^) (rfXo:0 . 

i=l 
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Finally a selection step is used to obtain an "unweighted" approximate empirical 
distribution p((ixo:t I hi-^, ai:^) of the weighted measure p((ixo:f|hi.t, ai:^). The basic idea 
is to discard samples with small weights and multiply those with large weights. The use 



of a selection step is key to making the SMC procedure effective; see Doucet et al. 



(2001[) for details on how to implement this black box routine. 



6 Experiments 

6.1 Full Information Policies 

In this section, three experiments are carried out to evaluate quantitatively and qual- 
itatively the proposed approach. The first experiment provides comparisons to other 
control policies on a synthetic dataset. The second experiment, on a similar synthetic 
dataset, demonstrates how the approach can handle large variations in scale, occlusion 
and multiple targets. The final experiment is a demonstration of tracking and classifica- 
tion performance on several real videos. For the synthetic digit videos, we trained the 
first-layer RBMs on the foveated images, while for the real videos we trained factored- 
RBMs on foveated natural image patches ( [Ranzato & Hinton[|2010| ). 



The first experiment uses 10 video sequences (one for each digit) built from the 
MNIST dataset. Each sequence contains a moving digit and static digits in the back- 
ground (to create distractions). The objective is to track and recognize the moving digit; 
see Figure [7| The gaze template had K = 9 gaze positions, chosen so that gaze G5 was 
at the center. The location of the template was initialized with optical flow. 

We compare the Hedge learning algorithm against algorithms with deterministic 
and random policies. The deterministic policy chooses each gaze in sequence and in 
a particular pre- specified order, whereas the random policy selects a gaze uniformly at 
random. We adopted the Bhattacharyya distance in the specification of the observation 
model. A multi-fixation REM was trained to map the first layer hidden units of three 
time consecutive time steps into a second hidden layer, and trained a logistic regressor 
to further map to the 10 digit classes. We used the transition prior as proposal for the 
particle filter. 



Tables 6.1 and 6.1 report the comparison results. Tracking accuracy was measured 
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1 


2 


3 


4 


5 


6 


7 


8 


9 




AVG. 


Learned 


1.2 


3.0 


2.9 


2.2 


1.0 


1.8 


3.8 


3.8 


1.5 


3.8 




2.5 


POLICY 


(1.2) 


(2.0) 


(1.0) 


(0.7) 


(1.9) 


(1.9) 


(1.0) 


(1.5) 


(1.7) 


(2.8) 




(1.6) 


Deterministic 


18.2 


536.9 


104.4 


2.9 


201.3 


4.6 


5.6 


64.4 


142.0 


144.6 




122.5 


POLICY 


(29.6) 


(395.6) 


(69.7) 


(2.2) 


(113.4) 


(4.0) 


(3.1) 


(45.3) 


(198.8) 


(157.7) 




(101.9) 


Random 


41.5 


410.7 


3.2 


3.3 


42.8 


6.5 


5.7 


80.7 


38.9 


225.2 




85.9 


POLICY 


(54.0) 


(329.4) 


(2.0) 


(2.4) 


(60.9) 


(9.6) 


(3.2) 


(48.6) 


(50.6) 


(241.6) 




(80.2) 



Table 1: Tracking error (in pixels) on several video sequences using different policies for gaze 
selection. 








1 


2 


3 


4 


5 


6 


7 


8 


9 


AVG. 


Learned 

POLICY 


95.62% 


100.00% 


99.66% 


99.33% 


99.66% 


100.00% 


100.00% 


98.32% 


97.98% 


89.56% 


98.01% 


Deterministic 

POLICY 


99.33% 


100.00% 


98.99% 


94.95% 


5.39% 


98.32% 


0.00% 


29.63% 


52.19% 


0.00% 


57.88% 


Random 

POLICY 


98.32% 


100.00% 


96.30% 


99.66% 


29.97% 


96.30% 


89.56% 


22.90% 


12.79% 


13.80% 


65.96% 



Table 2: Classification accuracy on several video sequences using different policies for gaze 
selection. 

in terms of the mean and standard deviation (in brackets) over time of the distance be- 
tween the target ground truth and the estimate; measured in pixels. The analysis high- 
lights that the error of the learned policy is always below the error of the other policies. 
In most of the experiments, the tracker fails when an occlusion occurs for the determin- 
istic and the random policies, while the learned policy is successful. This is very clear 
in the videos at: http : / /www. youtube . com/user /anonymousTrack 

The loss of track for the simple policies is mirrored by the high variance results in 
Table |6. 1 1 (experiments 0, 1, 4, and so on). The average mean and standard deviations 
(last column of Table |6.1| ) make it clear that the proposed strategy for learning a gaze 
policy can be of enormous benefit. The improvements in tracking performance are 
mirrored by improvements in classification performance (Table [61] ). 

Figure [7] provides further anecdotal evidence for the policy learning algorithm. The 
top sequence shows the target and the particle filter estimate of its location over time. 
The middle sequence illustrates how the policy changes over time. In particular, it 
demonstrates that hedge can effectively learn where to look in order to improve tracking 
performance (we chose this simple example as in this case it is obvious that the center 
of the eight (G5) is the most reliable gaze action). The classification results over time 
are shown in the third row. 
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Figure 7: Tracking and classification accuracy results with the learned policy. First row: 
position of the target and estimate over time. Second row: policy distribution over the 9 gazes; 
hedge clearly converges to the most reasonable policy. Third row: cumulative class distribution 
for recognition. 

The second experiment addresses a similar video sequence, but tracking multiple 
targets. The image scale of each target changes significantly over time, so the algorithm 
has to be invariant with respect to these scale transformations. In this case, we used a 
mixture proposal distribution consisting of motion detectors and the transition prior. We 
also tested a saliency proposal but found it to be less effective than the motion detectors 
for this dataset. Figure [8] (top) shows some of the video frames and tracks. The videos 
allow one to better appreciate the performance of the multi-target tracking algorithm in 
the presence of occlusions. 

Tracking and classification results for the real videos are shown in Figure [8] and the 
accompanying videos. 

6.2 Partial Information Policies 

In this section, two experiments are carried out to evaluate the performance of the dif- 
ferent gaze selection policies. 

In the first experiment we compare the performance of each gaze selection method 
on a data set of several videos of digits from the MNIST data set moving on a black 
background. The target in each video encounters one or more partial occlusions which 
the tracking algorithm must handle gracefully. Additionally, each video sequence has 
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Figure 8: Top: Multi-target tracking with occlusions and changes in scale on a synthetic video. 
Middle and bottom: Tracking in real video sequences. 








1 


2 


3 


4 


5 


6 


7 


8 


9 




Avg 


Bayesopt 


5.36 


7.92 


2.62 


4.05 


1.70 


8.31 


4.94 


12.09 


1.52 


9.06 




5.76 




(2.32) 


(2.52) 


(3.89) 


(1.67) 


(5.10) 


(3.35) 


(2.28) 


(3.53) 


(2.76) 


(1.66) 




(2.91) 


Hedge 


2.97 


3.20 


2.97 


2.92 


3.14 


2.96 


2.86 


2.98 


2.81 


3.15 




3.00 




(1.56) 


(2.19) 


(1.99) 


(2.00) 


(1.80) 


(2.08) 


(1.96) 


(1.76) 


(1.64) 


(3.73) 




(2.07) 


EXP3 


3.18 


3.03 


65.46 


91.81 


2.62 


7.20 


67.54 


2.97 


3.06 


77.01 




32.39 




(5.05) 


(10.08) 


(3212.16) 


(3671.66) 


(2.35) 


(303.29) 


(2346.82) 


(3.99) 


(2.71) 


(3135.17) 




(1269.33) 



Table 3: Tracking error on several video sequences using different methods for gaze selection. 
The table shows mean tracking error as well as the error variance (in brackets) over a single 
test sequence. 

been corrupted with 30% noise. We measure the error between the estimated track 
and the ground truth for each gaze selection method, and demonstrate that Bayesian 
optimization preforms comparably to Hedge, but that EXP3 is not able to reach a sat- 
isfactory level of performance. We also demonstrate qualitatively that the Bayesian 
optimization approach learns good gaze selection policies on this data set. 

Our second experiment provides evidence that the Bayesian optimization method 
can generalize to real world data. 

Table [3] reports the results from our first experiment. The table shows the mean 
tracking error, measured by averaging distance between the estimated and ground truth 
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track over the entire video sequence. Here we see that the Bayesian optimization ap- 
proach compares favorably to Hedge in terms of tracking performance, and that EXP3 
preforms substantially worse than the other two methods. Although Hedge preforms 
marginally better than Bayesian optimization, it is important to remember that Bayesian 
optimization solves a significantly more difficult problem. Hedge relies on discretizing 
the action space, and must have access to the rewards for all possible actions at each 
time step. In contrast, Bayesian optimization considers a fully continuous action space, 
and receives reward information only for the chosen actions. 




Figure 9: Top: Digit templates with the estimated reward surfaces superimposed. Markers 
indicate the best fixation point found in each often runs. Bottom: A visualization of the image 
found by averaging the best fixation points found across ten runs. 



Figure |9] shows the reward surfaces learned for each digit by Bayesian optimization, 
as well as a visualization of the overall best fixation points using data aggregated across 
ten runs. The optimal fixation points found by the algorithm are tightly clustered, and 
the resulting observations are very distinguishable. 



In our second experiment we use the Youtube celebrity dataset from Kim et al. 



(2008). This data set consists of several videos of celebrities taken from Youtube and 



is challenging for tracking algorithms as the videos exhibit a wide variety of illumi- 
nations, expressions and face orientations. We run our tracking model using Bayesian 
optimization to learn a gaze selection policy on this data set, and present some results in 



Figure [TO} Although we report only qualitative results from this experiment, it provides 
anecdotal evidence that Bayesian optimization is able to form a good gaze selection 
policy on real world data. 
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Figure 10: Results on a real data set. Far left: An example frame from the video sequence. 
Center left: The tracking template with the optimal fixation window highlighted. Center right: 
The reward surface produced by Bayesian optimization. The white markers show the centers of 
each fixation point in a single tracking run. Right: Input to the observation model when fixating 
on the best point. (Best viewed from a distance). 

7 Conclusions and Future Work 



We have proposed a decision-theoretic probabihstic graphical model for joint classifi- 
cation, tracking and planning. The experiments demonstrate the significant potential 
of this approach. We examined several different strategies for gaze control in both the 
full and partial information settings. We saw that a straightforward generalization of 
the full information policy to partial information gave poor performance and we pro- 
posed an alternative method which is able not only to perform well in the presence of 
partial information but also allows us to expand the set of possible fixation points to be 
a continuous domain. 

There are many routes for further exploration. In this work we pre-trained the 
(factored)-RBMs. However, existing particle filtering and stochastic optimization al- 
gorithms could be used to train the RBMs online. Following the same methodology, 
we should also be able to adapt and improve the target templates and proposal distribu- 
tions over time. This is essential to extend the results to long video sequences where 
the object undergoes significant transformations (e.g. as is done in the predator tracking 
system ( |Kalal et al.l[20T0| )). 

Deployment to more complex video sequences will require more careful and thought- 
ful design of the proposal distributions, transition distributions, control algorithms, tem- 
plate models, data-association and motion analysis modules. Fortunately, many of the 
solutions to these problems have already been engineered in the computer vision, track- 
ing and online learning communities. Admittedly, much work remains to be done. 
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Saliency maps are ubiquitous in visual attention studies. Here, we simply used stan- 
dard saliency tools and motion flow in the construction of the proposal distributions for 
particle filtering. There might be better ways to exploit the saliency maps, as neuro- 
physiological experiments seem to suggest ( [Gottlieb et al.[ 11998] ). 



One of the most interesting avenues for future work is the construction of more 
abstract attentional strategies. In this work, we focused on attending to regions of the 
visual field, but clearly one could attend to subsets of receptive fields or objects in the 
deep appearance model. 

The current model has no ability to recover from a tracking failure. It may be 
possible to use information from the identity pathway (i.e. the classifier output) to detect 
and recover from tracking failure. 

A closer examination of the exploration/exploitation tradeoff in the tracking setting 
is in order. For instance, the methods we considered assume that future rewards are 
independent of past actions. This assumption is clearly not true in our setting, since 
choosing a long sequence of very poor fixation points can lead to tracking failure. We 
can potentially solve this problem by incorporating the current tracking confidence into 
the gaze selection strategy. This would allow the exploration/exploitation trade off to 
be explicitly modulated by the needs of the tracker, e.g. after choosing a poor fixation 
point the selection policy could be adjusted temporarily to place extra emphasis on ex- 
ploiting good fixation points until confidence in the target location has been recovered. 
Contextual bandits provide a framework for integrating and reasoning about this type 
of side-information in a principled manner. 
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